A study of n-gram and decision tree letter language modeling methods

نویسندگان

Gerasimos Potamianos

Frederick Jelinek

چکیده

The goal of this paper is to investigate various language model smoothing techniques and decision tree based language model design algorithms. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of models for the text generation process: The n-gram language model and various decision tree based language models. In the rst part of the paper, we compare the most popular smoothing algorithms applied to the former. We conclude that the bottom-up deleted interpolation algorithm performs the best in the task of n-gram letter language model smoothing, signi cantly outperforming the back-o smoothing technique for large values of n. In the second part of the paper, we consider various decision tree development algorithms. Among them, a K-means clustering type algorithm for the design of the decision tree questions gives the best results. However, the n-gram language model outperforms the decision tree language models for letter language modeling. We believe that this is due to the predictive nature of letter strings, which seems to be naturally modeled by n-grams. Zusammenfassung Das Ziel dieses Beitrags ist verschiedene Techniken zur Glattung von Sprachmodellen und Algorithmen zum Entwurf von Sprachmodellen auf der Basis von Entscheidungsb aumen zu untersuchen. Zu diesem Zweck verwenden wir den BrownKorpus umModelle von Buchstabenfolgen zu erstellen. Wir betrachten zwei Klassen von Modellen zur Textgenerierung: das n-Gramm Sprachmodell sowie verschiedene auf Entscheidungsbaumen basierende Verfahren. Im ersten Teil dieses Beitrags vergleichen wir die am h au gsten benutzten Glattungsalgorithmen angewandt auf n-Gramme. Wir folgern, da der \bottom-up deleted interpolation"-Algorithmus am besten zur Glattung von n-Gramm Sprachmodellen geeignet ist und f ur gro e n dem \back-o "-Verfahren deutlich uberlegen ist. Im zweiten Teil dieses Beitrags betrachten wir dann verschiedene Algorithmen zur Bildung von Entscheidungsbaumen. Unter diesen erzielt ein K-means- ahnlicher Algorithmus die besten Ergebnisse beim Entwurf der Fragen die der Entscheidungsbaum stellt. F ur die Modellierung yCorresponding author. E-mail: [email protected]. von Buchstabenfolgen erzielt das n-Gramm Sprachmodell aber trotzdem noch bessere Ergebnisse als alle Entscheidungsb aume. Wir glauben, da dies durch die Fahigkeit der n-Gramme Buchstabenfolgen vorherzusagen begr undet ist. R esum e Le but de cet article est d' etudier di erentes techniques de lissage de mod eles de langage et di erents algorithmes de construction de mod eles de langage a base d'arbres de d ecision. Pour cela, nous construisons des mod eles de langage pour des caract eres ecrits (lettres) a partir du Brown corpus. Nous consid erons deux classes de mod eles pour le processus de g en eration du texte : le mod ele de langage n-gram, et di erents mod eles de langage a base d'arbres de d ecision. Dans la premi ere partie de l'article, nous comparons les algorithmes de lissage les plus couramment appliqu es au mod ele de langage n-gram. L'algorithme \bottom-up deleted interpolation" donne les meilleurs r esultats pour le lissage du mod ele de langage n-gram, d epassant de fa con signi cative la technique de lissage \back-o " pour de grandes valeurs de n. Dans la seconde partie de l'article, nous consid erons di erents algorithmes de d eveloppement d'arbres de d ecision. Parmi eux, un algorithme de type classi cation K-means aboutit aux meilleurs r esultats pour la construction des arbres de d ecision. Cependant, le mod ele de langage n-gram fournit de meilleurs r esultats que les mod eles de langage a arbres de d ecision pour la mod elisation du langage ecrit. Nous croyons que cela est dû a la nature pr edictive des châ nes de caract eres, qui semble être naturellement mod elis ee par les n-grams.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Decision trees have been applied to a variety of NLP tasks, including language modeling, for their ability to handle a variety of attributes and sparse context space. Moreover, forests (collections of decision trees) have been shown to substantially outperform individual decision trees. In this work, we investigate methods for combining trees in a forest, as well as methods for diversifying tre...

متن کامل

Title of dissertation : DECISION TREE - BASED SYNTACTIC LANGUAGE MODELING

Title of dissertation: DECISION TREE-BASED SYNTACTIC LANGUAGE MODELING Denis Filimonov, Doctor of Philosophy, 2011 Dissertation directed by: Dr. Mary Harper Department of Computer Science Dr. Philip Resnik Department of Linguistics Statistical Language Modeling is an integral part of many natural language processing applications, such as Automatic Speech Recognition (ASR) and Machine Translatio...

متن کامل

Generalized Interpolation in Decision Tree LM

In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbit...

متن کامل

Using Random Forests in the Structured Language Model

In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition...

متن کامل

Ranking stocks of listed companies on Tehran stock exchange using a hybrid model of decision tree and logistic regression

Much research has introduced linear or nonlinear models using statistical models and machine learning tools in artificial intelligence to estimate Iran's rate of return. The primary purpose of these methods is simultaneously use different independent variables to improve stock return rates' modeling. However, in predicting the rate of return, in addition to the modeling method, the degree of co...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Speech Communication

دوره 24 شماره

صفحات -

تاریخ انتشار 1998

A study of n-gram and decision tree letter language modeling methods

نویسندگان

چکیده

منابع مشابه

Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

Title of dissertation : DECISION TREE - BASED SYNTACTIC LANGUAGE MODELING

Generalized Interpolation in Decision Tree LM

Using Random Forests in the Structured Language Model

Ranking stocks of listed companies on Tehran stock exchange using a hybrid model of decision tree and logistic regression

عنوان ژورنال:

اشتراک گذاری